Overview

This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for fine particulate matter (PM2.5) from 2019, linked with 2018 data from the 5-year American Community Survey for the tract surrounding each testing site. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.

Introduction

Very small particulate matter in the air can be inhaled deep into the lungs, causing respiratory and cardiovascular health problems such as asthma attacks, bronchitis, heart attacks, and more. Particulate matter is just one type of air pollution, but particulate matter pollution and other types of air pollution such as carbon monoxide, nitrogen oxides, and sulfur dioxide are often generated by similar sources and thus, highly correlated with one another. Particulate matter as a pollutant is classified as either fine or coarse, with fine being defined as less than 2.5 microns in diameter (PM2.5) while coarse is larger than 2.5 microns but less than 10 microns. Combustion is the major generator of PM2.5, especially from car exhaust, coal- or gas-fired power plants, or burning wood. 12 micrograms per cubic meter is considered the maximum acceptable concentration of PM2.5 in ambient air. This corresponds to the boundary between the “Good” (green) and “Moderate” (yellow) zones for the Air Quality Index (AQI) as updated in 2012.

Poor air quality can have a substantial negative influence on health and quality of life, but different households have different resources, incentives, and trade-offs to consider in determining where to live, especially since the sources of pollution like factories and highways can also be essential connections to work, family, and friends. As a result, we might expect that the relationship between place of residence and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, I aim to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality.

Methods

I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.

# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key   <- Sys.getenv("EPA_AQS_KEY")

# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) { 
  rootpath <- "https://aqs.epa.gov/data/api/"
  morepath <- paste0(rootpath,whichData)
  fullParams <- append(list(email = api.email, key = api.key),customParams)
  step1.json <- httr::GET(url = morepath, query = fullParams)
  step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
  step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
  return(step3.df)
}

# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())

# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48
lower48.codes <- as.vector(as.matrix(list.states.lower48[,1]))

To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.

# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
  newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
  collector <- bind_rows(collector,newstate)
  }
  return(collector)
}

# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
 
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
                                 ,dailyPM2.5_201902
                                 ,dailyPM2.5_201903
                                 ,dailyPM2.5_201904
                                 ,dailyPM2.5_201905
                                 ,dailyPM2.5_201906
                                 ,dailyPM2.5_201907
                                 ,dailyPM2.5_201908
                                 ,dailyPM2.5_201909
                                 ,dailyPM2.5_201910
                                 ,dailyPM2.5_201911
                                 ,dailyPM2.5_201912)

# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")

I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.

# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)

# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
                      dplyr::select(state_code, state, 
                             county_code, county,
                             city, cbsa_code, cbsa,
                             site_number, local_site_name, site_address,
                             latitude, longitude) %>%
                      unique() %>%
                      arrange(state_code,county_code,site_number)
dim(testing.sites)
# 965 combinations

# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,county_code,site_number) %>% unique() %>% dim()
# Only 253 unique site numbers. 770 state-site combos. 
# State + county + site number *is* unique (965 combos)

clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
  mutate(site_id = paste0(state_code,county_code,site_number))

Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.

# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)

# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
  newstate <- lower48.codes[s]
  newtracts <- tracts(newstate)
  newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
  if (s==1){
    testing.sites.tracts <- newjoin
  } else {
    testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
  }
}

# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")

From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.

testing.sites.tracts <- readRDS("testing_sites_tracts.RDS")
head(testing.sites.tracts)

# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)

# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <- 
  census.1yr.vars %>% 
  mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
  mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>% 
      # Double escape needed to match open parenthesis!
  mutate(root.name = substring(name,1,6)) %>%
  group_by(root.name,root.concept) %>%
  summarise(obs=n(), .groups="keep") %>%
  arrange(root.name,-obs)

# View(sort(unique(census.1yr.concept.counts$root.concept)))

# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
     "B01002_001" # median age
    ,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
    ,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
    ,"B06008_003","B06008_001" # Now married // Denom
    ,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
    ,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
    ,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
    ,"B21001_002","B21001_001" # Veteran // Denom
    ,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
    ,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
    ,"B22008_001"  # Median household income, past 12mo
    ,"B25008_003","B25008_001" # Renter occupied, Denom
    ,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
    ,"B25064_001" # Median gross rent
    ,"B25088_002" # Median monthly owner costs for households with a mortgage
    ,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
    ,"B08006_003","B08006_001" # Drove to work alone // Denom
    ,"B08013_001" # Aggregate travel time to work in minutes
  )) %>% arrange(name)

final.acs.var.list

Once more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.

head(testing.sites.tracts)

# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>% 
                mutate(site_id = paste0(state_code,county_code,site_number)) %>%
                dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL

# Loop through existing list of state codes
lower48.codes

# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame

acs.loop <- function(){
  for (st in seq_along(lower48.codes)) {
    
    newstate <- lower48.codes[st]
    
    county.list <- tst.minimal %>% 
                  filter(STATEFP==newstate) %>% 
                  dplyr::select(COUNTYFP) %>%
                  arrange(COUNTYFP) %>%
                  as.matrix() %>% as.vector() %>% as.list()
    
    results.acs <- get_acs( geography = "tract"
                          ,variables=as.list(final.acs.var.list$name)
                          ,state = newstate
                          ,county = county.list
                          ,output = "wide")
    
    smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
  
    linked.acs <- tst.minimal %>% 
                filter(STATEFP==newstate) %>%
                left_join(.,smaller.acs,by="GEOID")
    
    if (st==1){
      collect <- linked.acs
    } else {
      collect <- bind_rows(collect,linked.acs)
    }
  }
  return(collect)
}

testing.sites.acs <- acs.loop()
testing.sites.acs

saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")

With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).

testing.sites.acs <- readRDS("testing_sites_acs.RDS")
head(testing.sites.acs)

# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
  dplyr::rename( age.median       = B01002_001E
                ,income.hh.median = B22008_001E 
                ,rent.median      = B25064_001E
                ,home.pmt.median  = B25088_002E
                ,commute.time.agg = B08013_001E) %>%
  mutate( race.black.any          = 100*B02009_001E/B02001_001E
         ,ethn.nhisp.white.alone  = 100*B03002_003E/B03002_001E
         ,married                 = 100*B06008_003E/B06008_001E
         ,kids                    = 100*B23007_002E/B23007_001E
         ,bachelor.plus           = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
         ,veteran                 = 100*B21001_002E/B21001_001E
         ,manufacturing           = 100*B08126_004E/B08126_001E
         ,farm.fish.mining        = 100*B08126_002E/B08126_001E
         ,commutes.car.alone      = 100*B08006_003E/B08006_001E
         ,renter                  = 100*B25008_003E/B25008_001E
         ,single.fam.home         = 100*(B25024_002E + B25024_003E)/B25024_001E
         ,worked.12mo             = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E) 
         ,poverty.12mo            = 100*B17001_002E/B17001_001E
         ,snap.12mo               = 100*B22010_002E/B22010_001E
         ,denominator             = B02001_001E) %>%
  dplyr::select(-matches("^B[012].*"))

head(testing.sites.acs.ready)
summary(testing.sites.acs.ready)
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()

length(unique(testing.sites.acs.ready$GEOID))
# 953 - not unique by tract

# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <- 
  testing.sites.acs.ready %>% 
  mutate(commute.time.agg  = commute.time.agg/10000
         ,income.hh.median = income.hh.median/10000
         ,rent.median      = rent.median/100
         ,home.pmt.median  = home.pmt.median/100
         ) %>%
  dplyr::rename(commute.time.agg.10k  = commute.time.agg
                ,income.hh.median.10k = income.hh.median
                ,rent.median.100      = rent.median
                ,home.pmt.median.100  = home.pmt.median) %>%
  dplyr::filter(denominator >= 500 & complete.cases(.))

summary(testing.sites.acs.complete)
length(unique(testing.sites.acs.complete$GEOID))

I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.

summary(clean.2019.dailyPM2.5)

# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
                        dplyr::select(site_id,GEOID) %>%
                        unique()

GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
                               ,clean.2019.dailyPM2.5
                               ,by="site_id")

dim(GEOID.dailyPM2.5)
# 1,396,734 rows

dim(unique(GEOID.dailyPM2.5[,c("GEOID","date_local")]))
# 243,809 tract-days

table(GEOID.dailyPM2.5$event_type, useNA="always")
# Included, Excluded, None
table(GEOID.dailyPM2.5$validity_indicator, useNA="always")
table(GEOID.dailyPM2.5$validity_indicator, GEOID.dailyPM2.5$event_type, useNA="always")
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()

summary(GEOID.dailyPM2.5$arithmetic_mean)

# Create preliminary outcome variables
GEOID.daily.summary <-
  GEOID.dailyPM2.5 %>%
  dplyr::filter(validity_indicator=="Y") %>%
  mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
  group_by(GEOID,date_local) %>%
  summarise(mean = mean(arithmetic_mean),.groups = "keep")

# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)

GEOID.AQI.outcomes <-
  GEOID.daily.summary %>%
  mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
  group_by(GEOID) %>%
  summarise( pm2.5_days         = n()
            ,pm2.5_p50          = quantile(mean, 0.50)
            ,pm2.5_ungreen_days = sum(ungreen)
            ,.groups="keep")

head(GEOID.AQI.outcomes)
summary(GEOID.AQI.outcomes)

# How many days of valid data do we have per tract?
table(GEOID.AQI.outcomes$pm2.5_days)

# Create final PM2.5 outcome variables
AQI.outcomes.final <-
  GEOID.AQI.outcomes %>%
  dplyr::filter(pm2.5_days >= 50) %>%
  mutate( pm2.5_pct_ungreen    = 100 * pm2.5_ungreen_days / pm2.5_days
         ,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days *  5 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ) %>%
  mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-10%",">10%"))
         ,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-20%",">20%"))) %>%
  dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)

# Review created outcome variables
head(AQI.outcomes.final)
summary(AQI.outcomes.final$pm2.5_p50)
summary(AQI.outcomes.final$pm2.5_pct_ungreen)
table(AQI.outcomes.final$pm2.5_flag10_ungreen)
table(AQI.outcomes.final$pm2.5_flag20_ungreen)
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag10_ungreen==">10%") %>% summary()
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag20_ungreen==">20%") %>% summary()

With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.

str(AQI.outcomes.final)
str(testing.sites.acs.complete)

# Drop unnecessary geographic variables from the ACS file
acs.joinready <- 
  testing.sites.acs.complete %>%
  dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
  unique()

dim(acs.joinready)
length(unique(acs.joinready$GEOID))
# 920 rows, unique on GEOID

# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)

# Prepare the geography/geometry file for linkage
str(testing.sites.tracts)
geom.joinready <-
  testing.sites.tracts %>%
  dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
  unique()
dim(geom.joinready)
length(unique(geom.joinready$GEOID))
geom.joinready %>% group_by(GEOID) %>% summarise(obs=n(),.groups="keep") %>% arrange(-obs) %>% head()
geom.joinready %>% dplyr::filter(GEOID=="06079012304")

# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <- 
  geom.joinready %>%
  mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
  unique()

dim(geom.joinready2)
length(unique(geom.joinready2$GEOID))
head(geom.joinready2)

# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)
class(AQI.ACS.geom)

Results

The continuous PM2.5 outcomes show geographic variability. Some patterns are evident; very rural tracts appear to have lower levels of PM2.5 compared to urban centers; the “Rust Belt” and southern California seem to have higher PM2.5 generally; but there do not appear to be large areas within which the analyzed tracts have very consistent values.

Continuous PM2.5 outcomes by tract, mapped:

AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet

palette <- colorNumeric("viridis", NULL, reverse=TRUE)

# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_p50)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_p50,   
            title = 'PM2.5',  
            opacity = 1) %>%    
  addScaleBar()
# % of days above 12 mcg/m3 by tract - interactive
popup_msg2 <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Reached 12mcg/m3 on ",round(AQI.ACS.geom.WGS84$pm2.5_pct_ungreen,digits=1),"% of days")

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_pct_ungreen)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg2) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_pct_ungreen,   
            title = '% days of 12+ mcg/m3 PM2.5',  
            opacity = 1) %>%    
  addScaleBar()

Regression results, using all 19 ACS-based predictor variables:

# Run linear regression for continuous outcomes
lm.p50 <- lm(pm2.5_p50 ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

lm.pct.ungreen <- lm(pm2.5_pct_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

# Run logistic regression for binary outcomes
logit.flag10.ungreen <- glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

logit.flag20.ungreen <- glm(pm2.5_flag20_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

# Extract p-values by predictor variable from regression results into a summary data frame.
lm.p50.pval.df <- data.frame(acs.var = names(summary(lm.p50)$coefficients[-1,4])
                             ,lm.p50.pval  = summary(lm.p50)$coefficients[-1,4])

lm.pct.ungreen.pval.df <- data.frame(acs.var = names(summary(lm.pct.ungreen)$coefficients[-1,4])
                                    ,lm.pct.ungreen.pval   = summary(lm.pct.ungreen)$coefficients[-1,4])

logit.flag10.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag10.ungreen)$coefficients[-1,4])
                                    ,logit.flag10.ungreen.pval = summary(logit.flag10.ungreen)$coefficients[-1,4])

logit.flag20.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag20.ungreen)$coefficients[-1,4])
                                    ,logit.flag20.ungreen.pval = summary(logit.flag20.ungreen)$coefficients[-1,4])

# Join p-value data frames together by ACS-based variable name. 
# Sort ascending by p-value for median PM2.5 regression
pval1 <- inner_join(lm.p50.pval.df,lm.pct.ungreen.pval.df, by="acs.var")
pval2 <- inner_join(pval1,logit.flag10.ungreen.pval.df, by="acs.var")
pval3 <- inner_join(pval2,logit.flag20.ungreen.pval.df, by="acs.var") %>% arrange(lm.p50.pval)

# Return results

summary.lm(lm.p50)
# Adjusted R-squared:  0.2364 
summary(lm.pct.ungreen)
# Adjusted R-squared:  0.1353 
summary(logit.flag10.ungreen)
# AIC: 1118.6
summary(logit.flag20.ungreen)
# AIC: 842.87

pval3
## 
## Call:
## lm(formula = pm2.5_p50 ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3401 -1.0163 -0.0799  1.0319  6.7636 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            10.281525   1.188349   8.652  < 2e-16 ***
## age.median             -0.030768   0.012186  -2.525 0.011746 *  
## commute.time.agg.10k    0.011560   0.020144   0.574 0.566215    
## income.hh.median.10k   -0.020663   0.061624  -0.335 0.737473    
## rent.median.100         0.013211   0.027278   0.484 0.628282    
## home.pmt.median.100    -0.015226   0.018848  -0.808 0.419410    
## race.black.any          0.015403   0.003726   4.134 3.90e-05 ***
## ethn.nhisp.white.alone -0.007592   0.003167  -2.398 0.016705 *  
## married                 0.004083   0.008184   0.499 0.617959    
## kids                   -0.008251   0.006990  -1.180 0.238167    
## bachelor.plus           0.005070   0.006317   0.803 0.422426    
## veteran                -0.046590   0.018786  -2.480 0.013321 *  
## manufacturing           0.031802   0.008779   3.623 0.000308 ***
## farm.fish.mining       -0.046542   0.010051  -4.630 4.19e-06 ***
## commutes.car.alone      0.019526   0.005590   3.493 0.000501 ***
## renter                  0.004082   0.005638   0.724 0.469257    
## single.fam.home        -0.002814   0.003933  -0.716 0.474407    
## worked.12mo            -0.041731   0.008345  -5.000 6.89e-07 ***
## poverty.12mo           -0.011233   0.009332  -1.204 0.229001    
## snap.12mo              -0.002669   0.007966  -0.335 0.737633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.592 on 889 degrees of freedom
## Multiple R-squared:  0.2524, Adjusted R-squared:  0.2364 
## F-statistic:  15.8 on 19 and 889 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = pm2.5_pct_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.932  -5.907  -1.368   4.229  47.795 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            27.820524   6.444245   4.317 1.76e-05 ***
## age.median             -0.096365   0.066081  -1.458 0.145117    
## commute.time.agg.10k    0.016552   0.109237   0.152 0.879597    
## income.hh.median.10k   -0.211577   0.334180  -0.633 0.526815    
## rent.median.100        -0.005321   0.147927  -0.036 0.971314    
## home.pmt.median.100    -0.061119   0.102213  -0.598 0.550021    
## race.black.any          0.022965   0.020205   1.137 0.256017    
## ethn.nhisp.white.alone -0.050822   0.017172  -2.960 0.003162 ** 
## married                 0.013969   0.044380   0.315 0.753017    
## kids                   -0.008858   0.037905  -0.234 0.815273    
## bachelor.plus           0.021526   0.034253   0.628 0.529887    
## veteran                -0.136190   0.101874  -1.337 0.181615    
## manufacturing           0.172555   0.047607   3.625 0.000306 ***
## farm.fish.mining       -0.134663   0.054507  -2.471 0.013677 *  
## commutes.car.alone      0.025827   0.030314   0.852 0.394454    
## renter                  0.029989   0.030575   0.981 0.326949    
## single.fam.home         0.022962   0.021327   1.077 0.281930    
## worked.12mo            -0.152661   0.045256  -3.373 0.000775 ***
## poverty.12mo           -0.061633   0.050605  -1.218 0.223584    
## snap.12mo               0.021525   0.043200   0.498 0.618417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 889 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1353 
## F-statistic: 8.475 on 19 and 889 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## glm(formula = pm2.5_flag10_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4284  -1.0623   0.5540   0.9632   1.7815  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             5.0426702  1.7523543   2.878  0.00401 ** 
## age.median             -0.0306630  0.0171606  -1.787  0.07397 .  
## commute.time.agg.10k   -0.0077921  0.0271376  -0.287  0.77401    
## income.hh.median.10k    0.0237597  0.0839018   0.283  0.77704    
## rent.median.100        -0.0340125  0.0371454  -0.916  0.35984    
## home.pmt.median.100    -0.0275504  0.0263048  -1.047  0.29494    
## race.black.any          0.0151165  0.0059452   2.543  0.01100 *  
## ethn.nhisp.white.alone -0.0108105  0.0044240  -2.444  0.01454 *  
## married                 0.0052513  0.0115284   0.456  0.64874    
## kids                   -0.0061786  0.0099806  -0.619  0.53588    
## bachelor.plus           0.0118970  0.0087472   1.360  0.17380    
## veteran                -0.0267337  0.0259119  -1.032  0.30221    
## manufacturing           0.0727012  0.0138298   5.257 1.47e-07 ***
## farm.fish.mining       -0.0343879  0.0146051  -2.355  0.01855 *  
## commutes.car.alone     -0.0048214  0.0079629  -0.605  0.54486    
## renter                  0.0069986  0.0079342   0.882  0.37773    
## single.fam.home        -0.0008233  0.0056277  -0.146  0.88369    
## worked.12mo            -0.0349466  0.0123114  -2.839  0.00453 ** 
## poverty.12mo           -0.0245880  0.0134790  -1.824  0.06813 .  
## snap.12mo               0.0059546  0.0114358   0.521  0.60258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1230.0  on 908  degrees of freedom
## Residual deviance: 1078.6  on 889  degrees of freedom
## AIC: 1118.6
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Call:
## glm(formula = pm2.5_flag20_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3347  -0.6637  -0.5104  -0.3763   2.4371  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             7.962e-01  1.923e+00   0.414  0.67891   
## age.median             -2.691e-03  2.082e-02  -0.129  0.89717   
## commute.time.agg.10k   -1.050e-02  3.917e-02  -0.268  0.78865   
## income.hh.median.10k   -2.337e-01  1.215e-01  -1.923  0.05451 . 
## rent.median.100         4.126e-02  5.053e-02   0.816  0.41426   
## home.pmt.median.100     1.835e-05  3.366e-02   0.001  0.99957   
## race.black.any          1.417e-03  5.585e-03   0.254  0.79970   
## ethn.nhisp.white.alone -9.910e-03  5.076e-03  -1.952  0.05092 . 
## married                 1.540e-03  1.376e-02   0.112  0.91089   
## kids                    7.767e-03  1.155e-02   0.673  0.50125   
## bachelor.plus           1.035e-02  1.109e-02   0.933  0.35085   
## veteran                -1.466e-02  3.223e-02  -0.455  0.64912   
## manufacturing           9.666e-03  1.415e-02   0.683  0.49446   
## farm.fish.mining       -1.214e-02  1.786e-02  -0.679  0.49683   
## commutes.car.alone      1.179e-02  9.554e-03   1.234  0.21711   
## renter                  9.602e-04  9.174e-03   0.105  0.91664   
## single.fam.home         8.781e-03  6.445e-03   1.362  0.17310   
## worked.12mo            -3.752e-02  1.316e-02  -2.850  0.00437 **
## poverty.12mo           -5.823e-03  1.521e-02  -0.383  0.70187   
## snap.12mo              -3.912e-03  1.249e-02  -0.313  0.75423   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 870.13  on 908  degrees of freedom
## Residual deviance: 802.87  on 889  degrees of freedom
## AIC: 842.87
## 
## Number of Fisher Scoring iterations: 5
## 
##                   acs.var  lm.p50.pval lm.pct.ungreen.pval logit.flag10.ungreen.pval logit.flag20.ungreen.pval
## 1             worked.12mo 6.892815e-07        0.0007749118              4.531875e-03               0.004367877
## 2        farm.fish.mining 4.193095e-06        0.0136765637              1.854684e-02               0.496829902
## 3          race.black.any 3.901449e-05        0.2560166477              1.100201e-02               0.799699586
## 4           manufacturing 3.082985e-04        0.0003058825              1.465384e-07               0.494461052
## 5      commutes.car.alone 5.012410e-04        0.3944535502              5.448595e-01               0.217107780
## 6              age.median 1.174578e-02        0.1451167688              7.396545e-02               0.897173725
## 7                 veteran 1.332100e-02        0.1816154439              3.022059e-01               0.649115596
## 8  ethn.nhisp.white.alone 1.670511e-02        0.0031615614              1.454114e-02               0.050915996
## 9            poverty.12mo 2.290010e-01        0.2235836814              6.812551e-02               0.701870729
## 10                   kids 2.381670e-01        0.8152731603              5.358795e-01               0.501254312
## 11    home.pmt.median.100 4.194097e-01        0.5500206039              2.949380e-01               0.999565046
## 12          bachelor.plus 4.224260e-01        0.5298874824              1.738009e-01               0.350849549
## 13                 renter 4.692571e-01        0.3269485271              3.777317e-01               0.916636934
## 14        single.fam.home 4.744072e-01        0.2819304914              8.836892e-01               0.173100905
## 15   commute.time.agg.10k 5.662155e-01        0.8795974016              7.740095e-01               0.788645632
## 16                married 6.179593e-01        0.7530172352              6.487405e-01               0.910894353
## 17        rent.median.100 6.282817e-01        0.9713137300              3.598450e-01               0.414260334
## 18   income.hh.median.10k 7.374727e-01        0.5268151364              7.770352e-01               0.054510257
## 19              snap.12mo 7.376328e-01        0.6184171875              6.025790e-01               0.754228687

Repeat linear regression for median PM2.5 using only the variables that were significant at p<.05 when all 19 were included:

# Repeat p50 regression with top 8. Do all remain significant? Does R-squared improve?
lm.p50.top8 <- lm(pm2.5_p50 ~ 
               age.median + race.black.any + ethn.nhisp.white.alone + worked.12mo + 
               veteran + manufacturing + farm.fish.mining + commutes.car.alone 
              ,data = AQI.ACS.geom)
summary(lm.p50.top8)
## 
## Call:
## lm(formula = pm2.5_p50 ~ age.median + race.black.any + ethn.nhisp.white.alone + 
##     worked.12mo + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2034 -1.0121 -0.0794  1.0368  6.8606 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.833839   0.621228  14.220  < 2e-16 ***
## age.median             -0.022178   0.008605  -2.577 0.010118 *  
## race.black.any          0.014055   0.003164   4.441 1.01e-05 ***
## ethn.nhisp.white.alone -0.007163   0.002729  -2.624 0.008828 ** 
## worked.12mo            -0.030674   0.006328  -4.847 1.48e-06 ***
## veteran                -0.049908   0.018060  -2.763 0.005836 ** 
## manufacturing           0.028004   0.008086   3.463 0.000559 ***
## farm.fish.mining       -0.050607   0.008711  -5.809 8.70e-09 ***
## commutes.car.alone      0.018190   0.004403   4.131 3.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.59 on 900 degrees of freedom
## Multiple R-squared:  0.2445, Adjusted R-squared:  0.2378 
## F-statistic:  36.4 on 8 and 900 DF,  p-value: < 2.2e-16
# Adjusted R-squared:  0.2378 

Bivariate plots of median PM2.5 and 10%-of-days outcomes for the 8 predictor variables with a significant (p<.05) association in at least 1 regression model:

# worked.12mo
ggplot(data = AQI.ACS.geom, aes(x = worked.12mo, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% Worked in Past 12 Months") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = worked.12mo)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% Worked in Past 12 Months") 

# manufacturing
ggplot(data = AQI.ACS.geom, aes(x = manufacturing, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "blue", method = "lm") + 
  xlab("% of Workers in Manufacturing Industry") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = manufacturing)) +
  geom_violin(fill = "springgreen1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Manufacturing Industry") 

# farm.fish.mining
ggplot(data = AQI.ACS.geom, aes(x = farm.fish.mining, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = farm.fish.mining)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category")
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm): collapsing to unique 'x' values

# ethn.nhisp.white.alone
ggplot(data = AQI.ACS.geom, aes(x = ethn.nhisp.white.alone, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "blue", method = "lm") + 
  xlab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = ethn.nhisp.white.alone)) +
  geom_violin(fill = "springgreen1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only")

# race.black.any
ggplot(data = AQI.ACS.geom, aes(x = race.black.any, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Respondents Reporting Black Race\n(with or without other categories)") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = race.black.any)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting Black Race\n(with or without other categories)")

# commutes.car.alone
ggplot(data = AQI.ACS.geom, aes(x = commutes.car.alone, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "blue", method = "lm") + 
  xlab("% of Workers Commuting by Car, Alone") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = commutes.car.alone)) +
  geom_violin(fill = "springgreen1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers Commuting by Car, Alone") 

# age.median
ggplot(data = AQI.ACS.geom, aes(x = age.median, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("Median Age (Years)") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = age.median)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("Median Age (Years)")

# veteran
ggplot(data = AQI.ACS.geom, aes(x = veteran, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "blue", method = "lm") + 
  xlab("% Veterans") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = veteran)) +
  geom_violin(fill = "springgreen1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% Veterans")

In considering the bivariate plots, several of the associations we noted from the multivariate models are apparent. Better air quality (lower PM2.5) is associated with:

Other associations with better air quality for which the bivariate plots are less convincing:

To explore how well these factors perform at predicting tracts with higher PM2.5, I perform 5-fold cross-validation to generate and evaluate predictions for the binary outcome for whether a tract had >10% of days above 12 mcg/m3. Four (4) approaches were assessed:

# Generate a column containing the 5 groups for 5-fold cross-validation
set.seed(1504492)
kgroup <- data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))
AQI.ACS.onlynec <- bind_cols(AQI.ACS,data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))) %>% 
                    select(-pm2.5_p50, -pm2.5_pct_ungreen, -pm2.5_flag20_ungreen, -denominator) %>%
                    group_by()

# Loop through the 5 groups
for (i in 1:5) {

    train <- filter(AQI.ACS.onlynec, kgroup != i)
    test  <- filter(AQI.ACS.onlynec, kgroup == i)
    
  # GLM: All variables
    glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , family = binomial())
    
    glm.predict <- bind_cols(test[,"GEOID"], data.frame(glmpred = predict(glm, test, type = "response")))
    
  # GLM: Selected variables
    select.glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + race.black.any + ethn.nhisp.white.alone + veteran 
               + manufacturing + farm.fish.mining + worked.12mo + commutes.car.alone
             , data = .
             , family = binomial())
    
    select.predict <- bind_cols(test[,"GEOID"], data.frame(selectpred = predict(select.glm, test, type = "response")))
    
    # Random forest
    rf <- train %>% randomForest(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , ntree = 2000)
    
    rf.predict <- bind_cols(test[,"GEOID"], data.frame(rfpred = predict(rf, test, type = "prob")[,2]))
    
    # Support vector machines with radial kernel
    svm <- train %>% svm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , scale = TRUE
             , kernel = "radial"
             , probability = TRUE)
    
    svm.step1 <- predict(svm, test, probability = TRUE)

    svm.predict <- bind_cols(test[,"GEOID"], data.frame(svmpred = attr(svm.step1, "probabilities")[,2]))

    # Join the predictions together by GEOID
    two.predict   <- inner_join(glm.predict, rf.predict, by="GEOID")
    three.predict <- inner_join(two.predict, select.predict, by="GEOID")
    four.predict  <- inner_join(three.predict, svm.predict, by="GEOID")
        
    if (i==1){
      collect.predict <- four.predict
    } else {
      collect.predict <- bind_rows(collect.predict, four.predict)
    }
}

AQI.ACS.preds <- inner_join(AQI.ACS.onlynec, collect.predict, by = "GEOID")
# Calculate AUROC values for each method under 5-fold cross-validation
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, rfpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, glmpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, selectpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, svmpred, ci=TRUE)

# Plot ROC curves for each method
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$svmpred, col="yellow")
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$rfpred, col="darkgreen", add=TRUE)
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$glmpred, col="blue", add=TRUE)
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$selectpred, col="orange", add=TRUE)
legend("bottomright"
       ,legend = c("SVM (Radial)","Random Forest", "Logistic: All Vars", "Logistic: Select Vars")
       ,col = c("yellow","darkgreen", "blue", "orange")
       ,lwd = 1)

## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = rfpred,     ci = TRUE)
## 
## Data: rfpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.6966
## 95% CI: 0.6618-0.7314 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = glmpred,     ci = TRUE)
## 
## Data: glmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7062
## 95% CI: 0.6719-0.7405 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = selectpred,     ci = TRUE)
## 
## Data: selectpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7157
## 95% CI: 0.6821-0.7494 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = svmpred,     ci = TRUE)
## 
## Data: svmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.706
## 95% CI: 0.6713-0.7407 (DeLong)

These 4 approaches performed very similarly, achieving AUCs of 0.70-0.72, with linear regression (top 8) slightly outperforming the others.

The community characteristics identified from the American Community Survey did show some interesting relationships with PM2.5 levels, as listed above. Taken together, though, their performance at predicting whether a tract will have worse air quality is fair to poor. Interestingly, while the racial makeup of tracts showed a clear relationship with air quality, variables that are often used to explain away differences by race - such as income, poverty, and educational attainment - turned out to be unimportant in the models. In future work, it would be interesting to explicitly consider population density and proximity to major highways, and/or proximity to a known set of industrial sites, to probe the extent to which the expected causal link to major pollutants is evident from the data.